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Abstract. In randomized experiments, treatment and control groups 
should be roughly the same — balanced — in their distributions of pre- 
treatment variables. But how nearly so? Can descriptive comparisons 
meaningfully be paired with significance tests? If so, should there be 
several such tests, one for each pretreatment variable, or should there 
be a single, omnibus test? Could such a test be engineered to give eas- 
ily computed p-values that are reliable in samples of moderate size, or 
would simulation be needed for reliable calibration? What new con- 
cerns are introduced by random assignment of clusters? Which tests of 
balance would be optimal? 

To address these questions. Fisher's randomization inference is ap- 
plied to the question of balance. Its application suggests the reversal 
of published conclusions about two studies, one clinical and the other 
a field experiment in political participation. 

Key words and phrases: Cluster, contiguity, community intervention, 
group randomization, randomization inference, subclassification. 



1. INTRODUCTION 

In a controlled, randomized experiment, treatment 
and control groups should be roughly the same — 
balanced — in their distributions of pretreatment vari- 
ables. But how nearly so? Reports of clinical trials 
are urged to present tables of treatment and con- 
trol group means of x-variables (Campbell et al., 
2004), and they often do. These greatly assist qual- 
itative assessments of similarity and difference be- 
tween the groups, but in themselves they are silent 
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as to whether, given the design, the discrepancies 
between the groups are large or small. Can the de- 
scriptive comparisons meaningfully be paired with 
significance tests? If so, must there be several, one 
for each variable, or can there be a single omnibus 
test? Would the omnibus test always require a sim- 
ulation experiment, as proposed at some places in 
the literature on random assignment by group (Raab 
and Butcher, 2001)? Is there a large-sample test that 
is reliable in samples of moderate size, notwithstand- 
ing recent evidence to the contrary about one nat- 
ural procedure (Gerber and Green, 2005)? At the 
level of foundations, some authors note that to as- 
sume experimental subjects to have been sampled 
from a superpopulation is antithetic to the nonpara- 
metric spirit common to randomized trials, and in- 
creasingly even to nonrandomized studies (Imai et al. , 
2008). Does testing for balance require a 
superpopulation-sampling model, as these authors 
also claim, or are there tests that more narrowly 
probe data's conformity to the experimental ideal? 
Relatedly, tests based on differences of group means 
require precise instructions for combining differences 
across strata or blocks, with the optimal approach 
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appearing to depend on within- and between-stratum 
variation in x — within- and between- variation in the 
population, not the sample (Kalton, 1968). Does not 
the fine-tuning of these instructions require assump- 
tions about, or estimation of, variabihty in the su- 
perpopulation, introducing sources of uncertainty 
that are generally ignored when drawing inferences 
about treatment effects (Yudkin and Moher, 2001)? 
Without notional superpopulations of x-values, how 
are alternatives to the null hypothesis to be con- 
ceived? What tests are optimal against these alter- 
natives? 

The most familiar randomized comparisons of hu- 
man subjects, perhaps, are drug and vaccine stud- 
ies. Generally these are randomized at the level of 
individuals. But interventions upon neighborhoods, 
classrooms, clinics and families are increasingly the 
objects of study, and are increasingly studied exper- 
imentally; and even nonexperimental interventions 
at the group level may be analyzed using a combi- 
nation of poststratification and analogies with hy- 
pothetical experiments. Might it be safe to ignore 
the group structure [as outcome analyses of cluster- 
randomized data often do (MacLennan et al. (2003); 
Isaakidis and loannidis (2003)), in some conflict with 
the recommendations of methodologists (Gail et al. 
(1996); Murray (1998); Donner and Klar (2000))] if 
interest focuses on individual- level outcomes, if cor- 
relations within group are low, or if the groups are 
small? Or, alternatively, do methods appropriate to 
individual-level assignment readily generalize to as- 
signment by group? 

1.1 Example: A Clinical Trial with 

Randomization at The Clinic Level 

In order to study the benefit of up-to-date, best 
practices in monitoring and treatment of coronary 
heart disease, the ASSIST trial randomized 14 of 21 
participating clinics to receive new systems for the 
regular review of heart disease patients 
(Yudkin and Moher, 2001). A primary outcome was 
whether monitoring assessments of heart patients 
met prescribed standards. One expects random as- 
signment to make treatment and control clinics com- 
parable in terms of what fractions of their heart 
patients were adequately assessed at baseline, and 
on baseline values of other relevant outcome vari- 
ables. As is evident from Table 1, however, the clin- 
ics varied greatly in size and in patient characteris- 
tics; these differences limit the power of coin-tossing 
to smooth over preexisting differences. Seemingly 



sizable differences between treatment and control 
groups' proportions of adequately assessed patients 
may still compare favorably with differences that 
would have obtained in alternate random assign- 
ments. Viewed in isolation, such differences would 
appear, misleadingly, to threaten comparability of 
intervention groups. A principled means of distin- 
guishing threatening and nonthreatening cases is 
needed. 

A related need is for metrics with which to ap- 
praise the likely benefit, in terms of balance, of ran- 
domizing within blocks of relative uniformity on base- 
line measures. 

1.2 Example: A Field Experiment on Political 
Participation 

A second case in point is A. Gerber and D. Green's 
Vote'98 campaign, a voter turnout intervention in 
which get-out-the-vote (GOTV) appeals were ran- 
domly assigned to households of 1 or 2 voters. This 
is cluster-level randomization, because members of 
two-voter households were necessarily assigned to 
the same intervention; but with clusters containing 
no more than two individuals, it is as close to ran- 
domization of subjects as randomization of clusters 
can get. Accordingly, Gerber and Green's (2000) re- 
port gave outcome analyses that ignored clustering, 
effectively assuming their treatment assignments to 
have been independent of subjects', rather than clus- 
ters', covariates, and finding that in-person appeals 
effectively stimulated voting whereas solicitations de- 
livered over the telephone, by professional calling 
firms, had little or no effect (Gerber and Green, 2000) 
Criticizing this analysis, Imai observes that the data 
Gerber and Green made available alongside their 
publication did not support the hypothesis of inde- 
pendence of subject-level covariates and treatment 
assignments (Imai, 2005). So poorly balanced are 
the groups, writes Imai, that the hypothesis of in- 
dependence can be rejected at the 10~^ level (Imai 
(2005), Table 6). Had experimental protocol broken 
down, effectively spoiling the random assignment? 
Imai deduces that it must have, dismissing the orig- 
inal analysis and instead mounting another upon 
very different assumptions. Contrary to Gerber and 
Green, Imai's revision attaches significant benefits 
to paid GOTV calls. 

In a pointed response, Gerber and Green (2005) 
shift doubt from the implementation of their ex- 
periment to Imai's methodology — particularly, the 
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Table 1 

Sizes of a subset of the 21 clinics participating in the ASSIST trial of register and recall systems for heart disease patients, 
along with baseline measurements of primary and secondary outcome variables 







Numbers of coronary heart disease patients 








Adequately 
assessed 




Treated with 




Practice # 


In total 


aspirin 


hypotensives 


lipid-reducers 


3 


38 


6 


30 


17 


6 


6 


58 


19 


38 


31 


16 


9 


91 


23 


60 


56 


22 


12 


114 


46 


86 


60 


35 


15 


127 


58 


103 


86 


30 


18 


138 


68 


106 


86 


57 


21 


244 


93 


181 


93 


63 



Despite the great variation in practice sizes, and in practice benchmarks targeted for improvement, a balanced allocation of 
practices to treatment conditions was sought. Adapted from Yudkin and Moher (2001), Table II. 



method by which he checks for balance. Their coun- 
terattack has three fronts. First, they point out that 
Imai's analysis assumed independent assignment of 
individuals, whereas assignment really occurred at 
the household level. Second, they present results from 
a replication of the telephone GOTV experiment on 
a much larger scale, now randomizing individuals 
rather than households. The replication results were 
consistent with those of the original study. Third, 
they present simulation evidence that would cast 
doubt on Imai's recommended balance tests even 
had randomization been as he assumed. Those tests 
carried an asymptotic justification, for which the 
Vote'98 sample appears to have been too small — 
even though it comprised some 31,000 subjects, in 
more than 23,000 households! 

The manifold nature of this argument makes 
methodological lessons difficult to draw. If the con- 
clusion that the Vote'98 treatment assignment lacked 
balance is mistaken, then did the mistake lie in the 
conflation of household- and individual-level ran- 
domization, in the use of an inappropriate statistical 
test, or both? 

1.3 Structure of the Paper 

This and Section 2 introduce the paper. Section 3 
develops randomization's consequences for the ad- 
justed and unadjusted differences of group on base- 
line variables. Section 4 adapts these measures to 
testing for balance on several variables simultane- 
ously. Section 5 develops theoretical arguments for 
the optimality of a specific approach recommended 
in Sections 3 and 4, and for the setting of a tun- 
ing constant, while Section 6 illustrates uses of the 



methodology for design and analysis. Section 7 con- 
cludes. 

2. TWO WAYS NOT TO CHECK FOR 
BALANCE 

This section examines appealing but ad hoc adap- 
tations of two standard techniques, the method of 
standardized differences and goodness-of-fit testing 
with logistic regression, to the problem of testing for 
balance after random assignment of groups. To illus- 
trate, we use the rich and publicly available Vote'98 
dataset (Gerber and Green, 2005). It describes some 
31,000 voters, falling in about 23,000 households; 
to complement this unusually large randomized ex- 
periment with a smaller one, we consider a simple 
random subsample of 100 households, comprising 
133 voters. We study the association of the treat- 
ment assignment, z, with available covariates, x, 
including age, ward of residence, registration sta- 
tus at the time of the previous election, whether 
a subject had voted in that election, and whether 
he had declared himself a member of a major po- 
litical party. Telephone reminders to vote were at- 
tempted to roughly a fifth of the subjects, and it is 
around the putative randomness of this treatment 
assignment that Gerber, Green and Imai's debate 
centers. 

2.1 Blurring the Difference Between Units of 
Assignment and Units of Measurement 

Let us contrast measurement units, subjects or 
elements, here voters, with clusters or assignment 
units, here households containing one or two voters. 
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The standardized difference of measurement units 
on a; is a scaled difference of the average of x-values 
among measurement units in the treatment group 
and the corresponding average for controls. To facil- 
itate interpretation, the difference is scaled by the 
reciprocal of one s.d. of measurement x's, so that 
100 X (standardized difference) can be read as a per- 
cent fraction of an s.d.'s difference. The purpose of 
this scaling, which is common in the matching liter- 
ature (Cochran and Rubin (1973), page 420), is to 
standardize across x-variables; it differs from direct 
standardization of means or rates of disparate pop- 
ulations (cf., e.g., Fleiss (1973); Breslow and Day 
(1987)), in which subpopulations' means or rates are 
combined using a standard set of reference weights. 

Setting the scaling aside, one has differences xt — 
Xc, or, in vector notation, z*x/z*l — (1 — z)*x/(l — 
z)*l, where z G {0, 1}" indicates assignment to the 
treatment group. Considering this difference as a 
random variable, Z*x/Z*l - (1 - Z)*x/(1 - Z)*l, 
and conditioning on the numbers of measurement 
units in the treatment and control groups, mt = Z*l 
and nic = (1 — Z)*l, makes it a shifted random sum: 

(1) ^-ii^ = ZW/^-lW-c, 

mt mc 

where h = {m~^ + m^^)^^ is half of the harmonic 
mean of m^. and mt. Were treatment-group mea- 
surement units a simple random subsample of the 
sample as a whole, basic theory of simple random 
sampling would imply that (1) has mean zero and 
variance equal to (mt?7T,c/m)s^(x), for s^(x) = (m — 
1)"-"^ ^i{xi — x)"^ and m = mt + mc- 

Consider instead the case in which a treatment 
group is selected by drawing a simple random sam- 
ple of clusters of measurement units, but the anal- 
ysis adopts the simplifying pretense that the group 
assigned to treatment constitutes a simple random 
sample of measurement units themselves. With this 
"fudge," differences readily converted 

to z-scores. In the debate described in Section 1.2 
above, both Gerber and Green (2000) and Imai (2005) 
took such an approach, perhaps reasoning that with 
cluster sizes no larger than two, differences between 
cluster- and individual-level randomization should 
be inconsequential. 

We mounted a simulation experiment to deter- 
mine whether this is so. The simulation mimicked 
the structure of the experiment's actual design, form- 
ing simulated treatment groups from random sam- 
ples of 5275 of the 23,450 households, calculating dif- 
ferences d*. in means of measurement unit x-values 



in the simulated treatment and control groups, and 
comparing these differences to the analogous dif- 
ference dx between subjects to whom the Vote'98 
campaign did and did not attempt a GOTV call. 
It reshuffled the treatment group 10^ times, making 
simulation p- values accurate to within 0.001. These 
p-values are given in the third and sixth columns 
of Table 2, which also presents p- values correspond- 
ing to the z-scores discussed above — ^p- values which 
ignore clustering — as well as large-sample p-values 
that account for clustering (by the method of Sec- 
tion 3.1, which attends to the difference of means of 
clusters' aggregated x-values rather than the differ- 
ence of individuals' mean x-values). All p- values in 
Table 2 are two-sided. 

The approximation ignoring the clustered nature 
of the randomization is not particularly good, espe- 
cially for m = 133. Its p- values differ erratically from 
the actual p-values, at some points incorrectly sug- 
gesting departures from balance and elsewhere exag- 
gerating it. (We had expressed the nominal "Ward" 
variable as 29 indicator variables, one for each ward, 
and the age measurement in terms of cubic B-splines 
with knots at quintiles of the age distribution, to 
yield six new x- variables; Table 2 displays the four of 
the 29 ward indicators, and the four of the six spline 
basis variables, for which the approximate p-values 
ignoring groups were most and least discrepant from 
actual p-values in the subsample and the full sam- 
ple.) Increasing the sample size from 133 to 31,000 
appears to improve the approximation somewhat, 
but not nearly as much as does explicitly account- 
ing for clustering. It is noteworthy that pretending 
assignment was at the individual level leads to such 
striking errors — even with only half the experimen- 
tal subjects assigned as part of a cluster, and even 
with no clusters larger than two. 

2.2 The p-Value from Logistic Regression of 
Treatment Assignment on x's 

With or without treatment assignment by clus- 
ters, and with or without analytic adjustments to ac- 
count for clusters, the method of standardized differ- 
ences has the limitation that it produces a long list 
of nonindependent p- values, one for each x-variate 
studied. In many settings just a few p-values, ide- 
ally one, would be more convenient. This is true 
both when appraising the integrity of a randomiza- 
tion procedure, as in Imai (2005) sought to do, and 
poststratifying an observational study with goal of 
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creating poststrata that resemble blocks of a ran- 
domized study in terms of observed covariates: for 
appraising such a poststratification, a list of possi- 
bly correlated test statistics is less helpful than a 
single omnibus test. 

Logistic regression seems well suited to these tasks, 
particularly when treatment has been assigned at 
the measurement-unit level. For simple randomiza- 
tion, regress treatment assignment, z, on covariates 
X and a constant, then on the constant alone, and 
compare the two fits using a standard asymptotic 
likelihood-ratio test. This one test speaks to whether 
X- variables influence Z, allowing each of the covari- 
ates to contribute to its verdict. Should the asymp- 
totics of this deviance test apply, it will reject (at 
the 0.05 level) no more than about 5% of treat- 
ment assignments, presumably the ones in which, 
by coincidence, covariate balance failed to obtain. 
(For block randomization, the analogous approach 
involves regressing z on x's and a separate constant 
for each block, then on those constants alone.) There 
are problems with this procedure, however. Sample- 
size requirements are more stringent than one might 
think, are difficult to ascertain, and are typically in- 
compatible with checking for balance thoroughly. 

Table 3 shows the small-sample performance of 
the logistic regression deviance test, presenting the 
actual sizes of asymptotic-level 0.001, 0.01, 0.05 and 
0.10 tests as applied to assignments of 14 of Yudkin 
and Maher's 21 clinics to treatment. The test's Type 
I error rates are markedly too high. Perhaps poor 
performance of asymptotic tests is to be expected, 
given the small sample size; but it is noteworthy 



Table 3 

Small-sample (n — 21) Type I error rates of two types of 
test, one based on logistic regression and another, to be 
described in Section 4, based on adjusted differences of 
treatment and control groups' covariate means 





Size of test 




Asymptotic 




Method 


0.001 


0.01 0.05 0.10 




Actual 




Logistic regression-based 


0.0281 


0.0620 0.16 0.24 


Combined baseline differences 


0.0000 


0.0003 0.018 0.064 



The actual size of the logistic regression tests well exceeds 
their nominal levels, while the alternate test is somewhat con- 
servative but holds to advertised levels. Based on 10^ simu- 
lated assignments to treatment of 14 of the 21 ASSIST clinics. 



that another asymptotic test. Section 4's method of 
combined baseline differences, succeeds in maintain- 
ing sizes no greater than advertised levels of signifi- 
cance. 

Figure 1 illustrates the limited accuracy of the lo- 
gistic regression approach in samples of moderate 
size. It compares asymptotic and actual null dis- 
tributions of p-values from the logistic regression 
deviance test, effecting the actual distribution by 
simulation. One thousand simulation replicates are 
shown, both for the 100-household Vote'98 subsam- 
ple and for the full sample. The covariates , ■ • ■ , 
are those described in Section 2.1, with x-values for 
two-person households determined by summing x- 
values of individuals in each household. 

While p-values based on the asymptotic approx- 
imation appear accurate for the full sample, with 



Table 2 

Effect of accounting for assignment by groups on approximations to p-values, in the full Vote '98 sample and in a subsample 

of 100 households 



100 households (m = 133) All households (m = 31K) 

Accounting for groups? Accounting for groups? 



Baseline variable {x) 


No 


Yes 


Actual 


No 


Yes 


Actual 


Number of voters in household 


0.12 


0.24 


0.21 


0.85 


0.82 


0.82 


Voted in 1996 


0.40 


0.22 


0.22 


0.23 


0.39 


0.39 


Major party member 


0.45 


0.14 


0.16 


0.24 


0.18 


0.18 


Bspline2 (Age) 


0.68 


0.59 


0.60 


0.06 


0.31 


0.31 


Bspline4(Age) 


0.82 


0.39 


0.40 


0.68 


0.68 


0.68 


Bsplines (Age) 


0.72 


0.24 


0.24 


0.39 


0.22 


0.22 


Bsplineg (Age) 


0.19 


0.62 


0.62 


0.56 


0.89 


0.89 


Ward 2 


1.00 


1.00 


0.50 


0.81 


0.87 


0.87 


Ward 5 


0.89 


0.98 


0.89 


0.44 


0.47 


0.48 


Ward 10 


0.58 


0.54 


0.65 


0.95 


0.97 


0.97 


Ward 11 


0.75 


0.92 


0.87 


0.27 


0.42 


0.42 
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UniffO,1] quanliias Llnif[0.1] quantiles 

Fig. 1. Theoretical and actual p-values of two omnibus tests of covariate balance, both accounting for clustering. With 100 
assignment units and 38 degrees of freedom (dark trace), logistic regression's p-values are markedly too small, whereas p-values 
from the method of combined baseline differences (Section 4) err toward conservatism, and to a lesser degree. The dashed lines 
at left indicate that the logistic regression-based test yielded p-values less than 0.05 m just under 40% of simulated random 
assignments, whereas the dashed lines at right indicate that the combined baseline difference-test with nominal level a — 0.05 
had actual size of about 0.01. However, with the full 23,000 assignment units (and the same 38 degrees of freedom), both 
methods perform as their asymptotics would predict, as indicated by the close agreement m both panels of the lighter traces 
and the 45° lines. 



its 23 thousand-someodd households, those for the 
subsample are quite exaggerated. In it, the nomi- 
nal 0.05-level test has an actual size of about 0.37. 
Would an alert applied statistician have identified 
the 100-household subsample as too small for the 
likelihood ratio test? Perhaps; it has only 2^ times 
as many observations as x-variables, once the Age 
and Ward variables have been expanded as in Ta- 
ble 2. But how large a ratio of observations to co- 
variates would be sufficient? Intuition may be a poor 
guide. To explore the difference in information car- 
ried by binary and continuous outcomes, Brazzale, 
Davison and Reid (2006, Section 4.2; see also Davison 
(2003), ex. 10.17) construct artificial data sets from 
a real one with a binary independent variable, some 
retaining the binary outcome structure but increas- 
ing the apparent information in the dataset by repli- 
cating observations, and others imputing continu- 
ous outcomes according to a logistic distribution. 
Their results are striking; one observation with con- 
tinuous response carries about as much information 
as eight observations with binary response, and de- 
viance tests are found to be unreliable even with 
11 times as many observations as x-variables. Har- 
rell (2001, Section 4.5), Peduzzi et al. (1996) and 
Whitehead (1993) offer somewhat less pessimistic 
guidelines, but even these would require well more 
than 10 times as many observations as x-variables — 
an odd condition to place on a comparative study. 



one which many otherwise strong studies would vi- 
olate. 

For contrast, the right panel of Figure 1 offers an 
analogous comparison between asymptotically ap- 
proximate and actual p-values of a test statistic to be 
introduced in Section 4. Even with relatively few ob- 
servations as compared to x-variables, its size never 
exceeds its nominal level (if it errs somewhat toward 
conservatism). 

3. RANDOMIZATION TESTS OF BALANCE, 
WITH AND WITHOUT CLUSTERS 

A common form of frequentism, sometimes traced 
to Neyman (1923), posits that subjects arrive in 
a study through random sampling from a broader 
population, and takes as its goal to articulate char- 
acteristics of that population. An impediment to ap- 
plying this conceptualization to comparative studies 
is that their samples need not represent background 
populations. Comparing within a sample and ex- 
trapolating from it are separate goals, neither of 
which needs to depend on the other. In contrast, 
in Fisher's model of a comparative study no back- 
ground population is supposed, but randomization 
is supposed to govern division of the sample into 
comparison groups. Inference asks what differences 
between groups can be explained by chance, rather 
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than what differences between sample and popula- 
tion can be explained by chance. Fisher's approach 
is better suited to appraising balance. 

3.1 Experiments with Simple Randomization of 
Clusters 

To illustrate, consider the question of whether in 
the Vote'98 experiment subjects assigned to receive 
a telephone reminder had voted in the prior election 
in similar proportion to those not so assigned. Since 
past voting is predictive of future voting, sizable dif- 
ferences to the advantage of either group may cause 
estimates of treatment effects to err, reflecting the 
baseline difference more than effects of GOTV in- 
terventions. 

Let the index i = 1, . . . ,n run over assignment units, 
so that Zi indicates the treatment assignment of the 
ith cluster of observation units. Interpret the 
total of X- values for observation units in cluster i, 
in this case the number of subjects in the house- 
hold who voted in the previous election, and let 
rrii be the size of that cluster, here 1 or 2, in ob- 
servation units, z, X and m are n-vectors record- 
ing these data for each assignment unit. The ob- 
served difference of the proportions of treatment 
and control group subjects who had cast votes in 
1996 can be written as a function dp{z,x.) of the 
treatment-group indicator vector z and indicators 
x of voting in the previous election. In symbols, 
dp{z, x) = z*x/z*m — (1 — z)*x/(l — z)*m; for gen- 
eral measurement variables v, dp{z,v) is the differ- 
ence of treatment and control group means. Let A 
be the set of treatment assignments from which the 
actual assignment z was randomly selected; for each 
member z* of A, it is straightforward to compute the 
amount dp(z*,x) by which treatments and controls 
would have differed had assignment z* been selected. 
A (two-sided) randomization p-value attaching to 
the hypothesis of nonselection on x is 

#{z* £ A:|dp(z*,x)| > |dp(z,x)|} 

(l/2)#{z*£A:|dp(z%x)| = |dp(z,x)|} 

(2) 

= P(|dp(Z,x)|>|dp(z,x)|) 
+ ^P(|dp(Z,x)| = |dp(z,x)|), 

where Z is a random vector that is uniformly dis- 
tributed on possible treatment assignments A. 
(Weighting by one-half those z* £ A for which 



\dp{z*,x)\ = \dp{z,x)\ makes this a mid-p value, the 
null distribution of which will be more nearly uni- 
form on [0, 1] than would a p-value without this 
weighting. Agresti and Gottard (2005) discuss mer- 
its of the mid-p value.) This appraisal of balance on 
X does involve probability, but only treatment as- 
signment, not the covariate, is modeled as stochas- 
tic. 

In principle, these p-values can be determined ex- 
actly, perhaps by enumeration; in practice, it is ac- 
curate enough, and often much easier, to evaluate 
them by simulation [as does, e.g., Lee (2006)]. Under 
favorable designs, fast and accurate Normal approx- 
imations are also available. Consider first the case in 
which 

(A) the assignment scheme allocates a fixed and 
predetermined number nt of the n clusters to 
treatment, and 

(B) each cluster contains the same number rriQ of 
measurement units. 

Then the ratios Z*x/Z*m and (1 - Z)*x/(1 - Z)*m 
of which (ip(Z,x) is a difference have constants, re- 
spectively, kQ = morit and ki = moric, as denomina- 
tors, so that, as in (1), dp(Z,x) has an equivalent 
of the form Z*x/A;o — l^x/ki. Then it is necessary 
only to approximate the distribution of Z*x, an eas- 
ier task than approximating the distribution of its 
ratio with another random variable. Indeed, if {i G 
{1, . . . , n} : Zj = 1} is a simple random sample of size 
nt, then Z*x is simply the sample sum of a simple 
random sample of nt from n cluster totals xi, . . . ,Xn- 
Common results for simple random sampling give 
that E(Z*x) =ntx = ^J2xi; that Var(Z*x) =nt{l- 
f )s2(x), where ^^(x) = {Eii^i - x)2)/(n - 1); and 
that if X has few or no outliers and is not particu- 
larly skewed, then if n is sufficiently large and nt/n 
is neither near nor 1, the law of Z*x will be roughly 
Normal. [Formally, if grows to infinity while nt/n 
approaches a constant in (0,1), and mean squares 
and cubes of |x| remain bounded, then the limit- 
ing distribution of Z*x is Normal (Hajek (1960); 
Erdos and Renyi (1959)).] Over and above this finite 
population central limit theorem, Hoglund's Berry- 
Esseen principle for simple random sampling (Hoglund 
(1978)) limits the error of the Normal approxima- 
tion in finite samples, suggesting that it should gov- 
ern Z*v similarly well for well behaved covariates 
V other than x, and that it should be quite good 
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even in samples of moderate size. Note that covari- 
ates X which are ill-behaved, in the sense of be- 
ing skewed or having extreme outliers, are also ill- 
suited to be summarized in terms of their means 
in any event — thus transformations to more regu- 
lar covariates x = f{x) are advisable in order to 
ease description, regardless of differences between 
d(Z*,x)'s and (i(Z*,x)'s sampling properties; and 
insofar as x appropriately measures the central ten- 
dency of {xi'.i < n), X will be well-behaved in the 
sense needed for (i(Z*,x) ~ AA(nfX,nf(l — ^)s^(x)). 

Cases in which (A) or (B) fails might appear to 
frustrate this argument. For instance, suppose treat- 
ment were assigned, in violation of (A), by n in- 
dependent Bernoulli(p) trials. Then there would be 
some random fluctuation in treatment and control 
group sizes Z*m and (1 — Z)*m, and the denom- 
inators of the ratios of which (ip(Z,x) is a differ- 
ence would no longer be constants, so that the ar- 
gument by which Hoglund's Berry-Esseen princi- 
ple bounded the error of the Normal approxima- 
tion would no longer be available. However, this 
particular frustration is circumvented by referring 
observed differences dp{z,x.) to conditional, rather 
than marginal, distributions of (ip(Z,x). For condi- 
tional on Z*l = z*l = n^, condition (A) is restored, 
and provided (B) also holds the distribution of 
dp(Z,x) is close to Normal, with mean and variance 
as previously indicated. 

What of departures from (B), that is, clusters that 
vary in size? Here the representation of dp{Z,x) as 
a linear transformation of Z*x need not apply, even 
after conditioning on the number of clusters selected 
for treatment, since then the number of treatment- 
group subjects Z*m may vary between possible as- 
signments. A modification to dp{-, •) circumvents the 
problem. Now writing nit for the expected, rather 
than observed, number of measurement units in the 
treatment group, set 

z*x (l-z)*x 
a(z,x) := 

mt m — mt 

[mt := E(Z*m), m = l*m] 

= fhT^ [z*x//i - l*x/(n - nt)] 

[h:= [nt{l-nt/n)Y^]. 

Kerry and Bland (1998) recommend an analogous 
statistic for outcome analysis in cluster randomized 
trials. 

In designs with size variation among assignment 
units, d(z,x) and dp(z,x) = z*x/z*m — (1 — z)*x/ 



(m — z*m) may differ. The differences will tend to 
be small, particularly if m, now regarded as a co- 
variate, is well balanced; and of course this balance 
is expediently measured using d(z, m) and its asso- 
ciated p-value. 

These considerations recommend d{z,x.) as a basic 
measure of balance on a covariate x. 

3.2 Simple Randomization of Clusters within 
Blocks, Strata or Matched Sets 

The approach extends to the case of block-random- 
ized designs, and to designs that result from post- 
stratification or matching. Let there be strata b = 
1,...,B, within which simple random samples of 
nti, . . . ,ntB clusters are selected into the treatment 
group from ni, . . . ,nB clusters overall, for each b = 
1,. . . ,B. Let Z = {Z\, . . . ,Zl,. . ., Z^)*, Z^ = (Zfti, . . . , 
Zhn^) for each stratum b, be a vector random vari- 
able of which the experimental assignment was a 
realization, and let m = (m*,...,m^)* record sizes 
of clusters in terms of observation units. For each 
b = 1, . . . , B, let mtb = E(Z^m{,) = fhbnti, be the ex- 
pected number of observation units in the treatment 
group. Let x = (x* , . . . ,x^)* and v = (v^, . . . , v^)* 
be single covariates — perhaps cluster sums of indi- 
vidual measurements. 

Because both treatment "propensities" [i.e., 
P(Zfti = 1), b = I, . . . , B] and covariate distributions 
may vary across blocks, comparisons of simple means 
of treatment and control units, even assignment units 
rather than measurement units, may fall prey to 
Simpson's paradox, despite random assignment 
(Blyth, 1972). Rather, when averaging across blocks 
the two means must be standardized by a common 
set of block-specific weights; or, equivalently, treat- 
ment and control averages can be taken and com- 
pared within blocks before taking the weighted av- 
erage of the differences. Within a block b, the (mod- 
ified) difference of treatment and control group means 
on x is simply z^Xft/m^b - (1 - ZbYxb/{m - mtb). 
Weights may be proportional to the number of sub- 
jects in each block, proportional to the number of 
treatment-group subjects in each block, or selected 
so as to be optimal under some model; this lat- 
ter approach is developed in Section 5. For now, 
fix positive weights ■wi,...,wb such that J2iWi = 
1. 

Considered as a random variable, the adjusted dif- 
ference of treatment and control group means is 

B 

d(Z,x) =^Wb[Zlxb/mtb 
6=1 
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(3) 



(1 - Zfe)*Xb/(m - mtb)] 



(4) 



b=l 



- WbTUf^ ^ {ub - nth) ^ l*Xb, 



where hb = [n^^ + (reft - ritb) ^] ^ = [ntb{l - Utb/ 
Ub)] is half the harmonic mean of ritb and (nj, — ntb)- 
Within block 6, Z^Xf, is the sample total of a sim- 
ple random sample of size ntb from {xbi-, ■ ■ ■ , Xbn)- It 
follows that it has mean {ntb/nb)l^Xb = ntbXb', that 
its variance is /i;,s^(xb); and that its covariance with 
Zlvb is hbs{xb;vb), for s{xb;vb) = {xb - Xbl)*(vb - 
Vbl)/{nb — 1) and s^(xf,) = s(xft;x{,). By virtue of 
the design, for blocks b' the treatment-group to- 
tals Z^Xft and Z^v;, are independent of Z^,Xf,/ and 
Z^,Vf,/. Together, these facts entail the following de- 
scription of the first and second moments of d(Z,x) 
and (i(Z,v). 

Proposition 3.1. Suppose that within blocks b = 
1, . . . ,B, simple random samples of ntb from nb clus- 
ters are selected for treatment, with the rest assigned 
to control. Let Z indicate sample membership and let 
X and V denote covariates. Ford{-,-) as in (3), one 
has 

E(d(Z,x)) = E(d(Z,v)) = 0, 



Var((i(Z,x))=^ 



^ hbrhb ifib 



B 



Cov(d(Z,x),d(Z,v)) = ^ 



wl s(xfe;vfe) 



^J^ hbTUb mb 

where hb = [n^^^ + {nb - ntb)~^]~^ ■ 

When (i(Z,x) can be assumed Normal, Proposi- 
tion 3.1 permits analysis of its distribution. In fact, 
relevant central limit theorems do entail its conver- 
gence to the Normal distribution as the size of the 
sample increases, and they suggest that the conver- 
gence should be fast and uniform across covariates 

X, V, There are two cases. In the first case, the 

size of each stratum falls under a fixed limit. Since 
the sample size is increasing, this means the number 
of strata tends to infinity. As each of them makes an 
independent contribution to the sum that is (i(Z,x), 



ordinary central limit theorems entail that its distri- 
bution tends to Normal. Indeed, the ordinary Berry- 
Esseen lemma limits the difference between the dis- 
tribution function of d(Z, x) and an appropriate Nor- 
mal distribution in terms of its variance and its third 
central moment (Feller 1971, Chapter 16) both of 
which are calculable precisely from the design and 
from the configuration of x. In the second case, at 
least one stratum size tends to infinity. Assume that 
in each growing stratum the proportions of clusters 
assigned to treatment and to control tend to nonzero 
constants. Then the contribution (/if,?reft)~^Z^X{, from 
any growing stratum 6 is a rescaled sum of a sim- 
ple random sample from (x^i , Xb2 i Xbm ) and is 
governed by the central limit theorem and Berry- 
Esseen principle for simple random sampling (see 
Section 3.1). Contributions from small strata that 
do not grow are either asymptotically Normal, by 
the first argument, or, assuming a nonpathologi- 
cal weighting scheme, asymptotically negligible, or 
both; it follows that the overall sum of stratum con- 
tributions tends to Normal. 

Although any weighting of blocks is possible, some 
are more likely to reveal imbalances than others. 
Section 5 shows weighting in proportion to the prod- 
uct of block-mean cluster size and the harmonic 
mean of ntb and nb — ntb, wl oc hbrhb, to be optimal 
in an important sense. It also so happens that with 
this weighting, expressions for the first and second 
moments of (i(Z,x) simplify. 

Corollary 3.1. Suppose that within blocks b = 
1, . . . ,B, simple random samples of ntb from nb clus- 
ters are selected for treatment, with the rest assigned 
to control. Let Z indicate sample membership and let 
X andv denote covariates. Ford{-,-) as in (3), with 
Wb = wloc hbfhb = fhbntb{l — ntb/nb), one has 

-1 



d{z,x) = (J2f^' 



ibnib 



(5) 



B 



.6=1 



B 



^ntbil^Xb/nb) 



(6) 



E(d(Z,x)) 
Var(d(Z,x)) 



E(d(Z,v)) = 0, 

2 



brrib 



B 



■ E ^brrib 
6=1 



ffib 
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Cov(d(Z,x),d(Z,v)) = (J2 ^brrib) 

3.3 Accommodating Independent Assignment by 
Conditioning 

Proposition 3.1 assumes simple random sampling 
of treatment groups within blocks. Were assignments 
within block h made by independent Bernoulli(p6) 
trials, the induced first and second moments of 
d(Z,x) — understood as a t(;fe-weighted sum of terms 

Zj^i (1 - Zb)^X6 

TTifoZ^l mb(nfe - ' 

since n^f, would no longer be a fixture of the design — 
would be formally and numerically similar to those 
of the proposition, as a simple argument shows. Z^l 
is Bin(nb,pf,), independently of Z^,l ~ Bin(n;,/,p;,/), 
h' 7^ 6, and conditionally on Zj^l = nu the distribu- 
tion of Z^Xf, is that of a sample sum of a simple 
random sample of size nt from {x^i , . . . , xiirn, 

}. In 

general, conditioning on Z^l, . . . , Z^l gives (i(Z, x) 
and d(Z,v) distributions of the type described in 
Proposition 3.1. 

Conditional assessments of (i(Z,x) have the ad- 
vantage of being immune from disruption by un- 
usually small or large allocations Z*l to treatment, 
i = 1, . . . ,B. The sizes of these allocations carry lit- 
tle relevant information, as a conditionality argu- 
ment shows. Consider the broader model in which 
P{Zf,i = 1) is not a constant for alH = 1, . . . , nj,, but 
instead logit(P(Zf,j)) = tpb + ipxixbi) ■ The null hy- 
pothesis holds that ipx = 0; a test of balance aims 
to reject it when ?/'x(') is nonnull. The likelihood of 
the full model, with independent sampling of Z^i^s 
and possibly nonzero V'sj can be straightforwardly 
represented as 

(7) ' 

rib ^ 
i=l J 

h = log[l + exp(V'fe + i^xixbi))], but it can also 
be parametrized in terms of the function ipxi') and 
moment parameters rji, = E(Z^l|V'fe, ipx), b = 1, . . . , B , 
with (771, ... , rjs) and ijjxi-) being variation indepen- 
dent (Barndorff-Nielsen and Cox (1994), page 40 ff). 
The statistic (Z^l, . . . , Z^l) is ancillary for infer- 
ence about the function ipx', in the main it reflects 
on {rib:b<B), not ijjxi,-)- 



3.4 Example and Implementation 

The Vote'98 experiment used a factorial design, 
varying the probability of a household's assignment 
to telephone GOTV across levels of the other treat- 
ments it assigned. (Specifically, households eligible 
for telephone calls were also eligible for assignment 
to receive GOTV mailings, and for assignment to 
receive a personal GOTV appeal; the probability 
of assignment to the telephone group varied across 
cells of the mail GOTV by personal GOTV cross- 
classification.) This makes methods for block-ran- 
domized studies a necessity. Consequently, Table 2 
uses modified differences of Section 3.2, as aggre- 
gated using harmonic block weights as in Corol- 
lary 3.1, to combine balance measures across sub- 
classes defined by treatments other than telephone 
GOTV. 

The first row of Table 2 gives results for the test 
as to whether z*m, the size of the treatment group 
in measurement units, differed substantially from 
E(Z*m), in a subsample of 100 clusters and in the 
full sample of some 23,000. The z-scores 
(i(z, m) / yjV{d) (not shown in the table) were 1.186 
and 0.226 for the sub- and full samples, respectively, 
which by Normal tables give approximate p-values 
of 0.236 and 0.821. This suggests z*m was relatively 
quite close to its null expectation, a suggestion that 
gains further support from simulations, which find 
the mid-p values to be 0.211 and 0.821, respectively. 
Having confirmed balance on cluster sizes, the next 
row of the table asks about voting in the previous 
election. It is not precisely the same in treatment 
and control groups, either for the subsample or for 
the full sample, as indicated by normalized differ- 
ences of (i(z,x)/-\/y((i) = 1.228 and —0.853, respec- 
tively; but the p- values, 0.224 and 0.391, indicate 
that voting in the previous election is as similar in 
the two groups as could be expected from random 
assignment, and the Normal approximation locates 
them with some accuracy, 0.220 and 0.394. 

To compute these adjusted baseline differences and 
their large-sample reference distributions, the first 
step, prior to calling any specialized function, is to 
aggregate the data to the cluster level, recording 
cluster sizes rrihi and creating cluster totals x^i from 
individual measurements xui , xumu ■ R- users can 
then adapt functionality from either of at least two 
R packages. Bowers and Hansen's (2006) RITOOLS or 
Hothorn et al.'s (2006) COIN, which perform 
randomization-based inference without explicit at- 
tention to cluster-level assignment. We give details 
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for RITOOLS, which uses harmonic weights, Whochh, 
by default. Its function xBalance calculates 



'^noclus(Zi X, 




and its randomization variance, printing significance 
stars based on the corresponding z-score, Var((inocius 
(Z, x))"^/^(inocius(z, x). [The 2-score itself is not dis- 
played; instead, descriptive measure xBalance 
reports a standardized difference in the sense of Sec- 
tion 2.1, namely s~^(x)(inocius(z, x) where Sp(x) is 
the pooled s.d. of x in the sense of the two-sample t- 
test comparing treatment to control clusters.] Since 

(8) differs from (3) with Wf, oc /iforrif, only in that its 
denominator is J2b=i rather than J2b=i ^bmb, this 
z-score is the same as that which Corollary 3.1 would 
have given. 

With other software, Var((f(Z,x)) may have to 
be calculated explicitly from (6), but a shortcut is 
available for determining c?(z,x) when w t, oc hi^mh: 
c^nocius(z,x), or [{J2bhmb)/{Y^bh)]d{z,:x.), coincides 
with the ordinary least squares coefficient of z in 
the regression of x on z and dummy variables for 
blocks. Unfortunately, Var(d(Z,x)) does not relate 
in any helpful way to this coefficient's ordinary least 
squares standard error. To recover d{z,x.) from the 
least squares coefficient, J2bhb and J2b hb'^b will have 
to be calculated. However, given that (6) has to be 
figured, these calculations pose little additional bur- 
den; and they are the same for each variable x on 
which balance is to be checked. 

4. SIMULTANEOUSLY TESTING BALANCE 
ON MULTIPLE x'S 

Ordinarily there will be several, perhaps many, x- 
variables along which balance ought to be checked, 
and a method of combining baseline differences will 
be needed. To this end, write 

(i^(z;xi,...,Xfc) 

(9) := [d(z,xi),...,d(z,Xfc)] 





/ 


"c?(Z,xi)- 




"(i(z,xi)" 


< Gov 






)[ 






I 


.d(Z,Xfc). 




.(i(z,Xfc)_ 



where Cov((i(Z,Xi), (i(Z, Xj)) is as in Proposition 3.1 
and M~ denotes a generalized inverse of M. This 
test has the desirable properties that: (i) it culmi- 
nates in a single test statistic and p-value; (ii) its 
law is roughly as a consequence of (i(Z,xi), . . . , 
(i(Z,Xfc) being approximately Normal; and, (iii) it 
appraises balance not only on xi , . . . , x^ , but also 
on all linear combinations of them. Large imbalances 
on the linear predictor of a response variable from 
xi,...,Xfc, for example, will make (i^(z,xi, . . . ,Xfc) 
large relative to its null distribution. The test is 
a first cousin of Hotelling's (1931) T-test, which 
treats xi , . . . , x^ rather than z as random and is F- 
distributed, rather than x^-distributed, under the 
null of equivalence between groups. 

Linearity of d{z,-) immediately establishes (iii). 
Arguments of Sections 3.1 and 3.2 entail that 
c?(Z,/3iXi + ••• + /SfcXfc), suitably scaled, must be 
asymptotically A^(0, 1) provided the Xj's are suitably 
regular, whatever /3i, . . . ,/3fc may be. It follows that 
the vector [d(z,xi), . . . ,d(z,Xfc)] has a multivariate 
Normal distribution in large samples, showing (ii). 
Then (i^(Z; xi, . . . , Xfc) is scalar- valued with a large- 
sample distribution on rank(Cov([d(z,xi), . . . , 
d(z,Xfc)])) degrees of freedom. 

To calculate d?{z;xi, . . . ,Xfc), one begins as if cal- 
culating each of {d{z; x^ : i = 1, . . . , fc) separately (Sec- 
tion 3.4). With RITOOLS, the xBalance function can 
calculate each of these simultaneously; in this case, 
it optionally returns (i^(z; xi, . . . , x^) and its corre- 
sponding degrees of freedom. Without this aid, the 
joint calculation differs from a sequence of univari- 
ate balance assessments only in requiring that co- 
variance matrices, rather than scalars, be scaled and 
summed across blocks b, and requiring the rank and 
a generalized inverse of the resulting sum. 

The x^-approximation seems to work reasonably 
well even in small samples. Its distribution in one 
small simulation experiment is graphed in the right 
panel of Figure 1, while Table 3 summarizes its dis- 
tribution in another; in both cases it tends some- 
what toward conservatism. As a practical tool for 
the data analyst, it has the important advantage 
that it stably handles saturation with x-variables; 
one would not bring about a spurious rejection of 
the hypothesis of balance by adding to the list of x- 
variables to be tested. One certainly would decrease 
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the test's power to detect imbalance along individual 
Xj's included among covariates tested, but that is to 
be expected. (An example is given in Section 6.2.) 
This is in important contrast with methods based on 
regression of z on x's; as the left panel of Figure 1 
shows, natural tendencies toward overfitting inflate 
the Type I errors of such tests. 

5. OPTIMIZING LOCAL POWER 

This section develops and analyzes a statistical 
model of the absence of balance that is appropriate 
to randomization inference. Casting this model as an 
alternative to the null hypothesis of balance, tests 
based on d or are seen to have greatest power 
when weights wl oc /if,mf, are used to combine dif- 
ferences across blocks or matched sets. Readers not 
seeking justification of this may prefer to skip to 
Section 6. 

Say balance is to be assessed against a canoni- 
cal model (Section 3.2) with B blocks, perhaps af- 
ter conditioning as in Section 3.3. What choice of 
weights wi, . . . jWb maximizes the power of the test 
for balance? Common results give the answer for 
models positing that x is sampled while z is held 
fixed. Kalton (1968), for instance, assumes random 
sampling from 2B superpopulations with means 
fJ'ti , /^ci; • • • , fJ'tB,IJ-cB- He finds that in order to max- 
imize power against alternatives to the effect that 
Uti, = /icfe + 5, 5/0, blocks' differences of means 
should be weighted in proportion to the inverse of 
the variance of those differences. With the simplify- 
ing assumption of a common variance in the 2B su- 
perpopulations, this leads to harmonic mean weight- 
ing, Wb oc hbrhf). To avoid this simplification, weights 
might be set in proportion to reciprocals of esti- 
mated variances. But such a procedure would seem 
to add complexity, and to detract from the credi- 
bility of assessments of statistical significance, since 
the sample-to-sample fiuctuation it imposes on the 
weighting scheme is difficult to account for at the 
stage of analysis (Yudkin and Moher (2001), page 
347). 

The randomization perspective leads to the same 
result, but by a cleaner route, avoiding the need to 
estimate or make assumptions about dispersion in 
superpopulations. In support of this claim, we an- 
alyze the problem of distinguishing unbiased from 
biased sampling of treatment assignment configura- 
tions, z's, from A, rather than differences in super- 
populations from which treatment and control x's 



are supposed to be drawn. This amounts to distin- 
guishing constant from nonconstant ipxi') in model 
(7). 

Our analysis is asymptotic, assuming increasing 
sample size. Since any nontrivial test would have 
overwhelming power given a limitless stock of sim- 
ilarly informative observations, we mount an anal- 
ysis of local power, in which the observations be- 
come less informative as sample size increases. This 
is modeled with x's that cluster increasingly around 
a single value as their number increases, while bias 
in assignment to treatment is dictated by the same 
tpx- The strata may increase in number or in size, 
or in both, as the number of assignment units in- 
creases; it is assumed that cluster size is bounded 
and that the fractions of blocks allocated to treat- 
ment ritb/nb are bounded away from and 1. Be- 
cause the observations are neither independent (due 
to conditioning on Z^l, 6 = I, . . . ,B) nor identically 
distributed, the asymptotic analysis pertains not to 
a single sequence of observations but to a sequence 
of experimental populations = 1, 2, . . . containing 
increasing numbers of observations. 

Conditions A1-A4, stated in the Appendix, en- 
tail certain convergences of weights and variances, at 
least along subsequences {fj} of populations. Specif- 
ically, with {v} narrowed to such a subsequence there 
are positive constants K,sqx,Su)x and v^x such that 
as — > oo, 



n 



(10) 



^^^m^bKtb 
b 



K and 



Wub— 



m„b 



s2 ■ 



n 



u^Wub 



S^(x,.fe) 2 



and 



m^b 



where (i(Z,y,Xjy) in (11) is understood in the sense 
of (12). 



Proposition 5.1. 
d{Zu,x^) 



Let 



(12) 



l^Wyb 

b 



z* 



(1-z 



vb) ^i/b 



rriybnytb rnyb{nub - riytb) 



Assume conditions A1-A4, write P and Q for dis- 
tributions of Tjy under, respectively, the null of unbi- 
ased assignment and the alternative of bias accord- 
ing to (7) with nonconstant ip, and let Swx,Vwx be 
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as in (11). Then 

PQ(d(Z„x,) > ^*Varp(d(Z„x,))i/2) 

(13) 

where (3 is the derivative of ip at c (as defined in 
condition A3). 

For a proof, see the Appendix. 
Compare (13) to 

Pp(d(Z^,x^) > z* Varp(d(Z^,x^))i/2) ^ i _ $(^*), 

a statement of the asymptotic normahty of (i(Zt/, Xj^) 
under the nuh hypothesis: in the hmit, the amount 
by which power exceeds size increases with the ra- 
tio s'^r^/vwx- Specificahy, if the acceptance region is 
hmited from above at Varp((i(Z,y,Xi,))^/^, Zu > 0, 
then power against alternatives with /3 > is opti- 
mized by cahbrating the stratum weights (w^b) so 
as to maximize YaTp{d{Z,y,x.))~^^'^{J2b'^ubs'^{^ub)/ 
ffiyb)., the hmit of which is slj^/vwx- (If the accep- 
tance region has a finite lower hmit, then a sym- 
metric argument yields that the same calibration 
maximizes power against alternatives with /3 < 0.) 
To effect this calibration, for the moment fix v and 
write si for s'^{^i,b) h'^ub-, b = 1, . . . , B. Recall that 
hb = [ntbi^ — ntb/n-b)] (Section 3.2). Then 

{J2bWubs'^{x^b)/m,,b) 
Varp((i(Z,,x))i/2 

(J:b^^l[hbrnb]-'sl)y^ 

(14) 

i{wb[hbmb]-^/^Sb):b<BY 
\\{{uJb[hmb]~y^Sb):b<B)h 
■{[hbmb]^^hb:b<B), 

where ||-||2 is the Euclidean norm, ||x||2 = iJ^i^iV^"^- 
Selecting {wb'.b = 1,...,B) so as to maximize this 
expression amounts to maximizing the correlation 
between i?-dimensional vectors {wb[hbmb]"^^'^ Sb ■ b = 
1,...,B) and {[hb'ifibl^^'^Sb.b = 1, . . . ,B), which is 
achieved by setting Wb oc hbfhb — that is, by w^b = 
Kb- 

6. APPLICATIONS TO STUDY DESIGN AND 
ANALYSIS 

6.1 Whether to Stratify, and Which 
Stratification is Best 

Randomization within well-chosen blocks may lead 
to imbalances on baseline measures of smaller ab- 
solute magnitude than unrestricted randomization. 



and smaller baseline imbalances are preferable for 
various reasons. Raab and Butcher (2001) sought to 
avoid imbalances large enough to create noticeable 
discrepancies between treatment effects estimated 
with and without covariance adjustment. Such dif- 
ferences might be troubling to the policymakers who 
were a central audience for their study, even if they 
fell within estimated standard errors. 
Yudkin and Moher (2001) worry that designs in 
which sizable imbalances are possible may sacrifice 
power. 

To head off these problems, Yudkin and Moher's 
ASSIST team elected to randomize clinics within three 
blocks, consisting of 6, 9 and 6 clinics, rather than 
to randomly assign treatment to 14 of 21 clinics 
outright. It remained to be decided which baseline 
variable to block on. They report deciding against 
blocking on clinic size after finding only weak cor- 
relations between clinics' sizes and baseline rates of 
adequate heart disease assessments; they feared that 
privileging size in the formation of blocks could have 
"resulted in imbalance in the main prognostic fac- 
tor" (Yudkin and Moher (2001), page 345). While 
these correlations are certainly reasonable to con- 
sider, it might have been more direct to compare 
candidate blocking schemes on the basis of the vari- 
ance in d(Z,-)'s they would entail, preferring those 
schemes that offer lesser mean-square imbalances on 
key prognostic variables. 

Table 4 offers such a comparison. It emerges that, 
despite the weak relationship between clinic size and 
baseline rate of adequate assessment, blocking on 
size balances the rate of adequate assessment quite 
well, nearly as well as does blocking on the rate it- 
self. Meanwhile, to balance other baseline variables, 
rates of treatment with various drugs that at follow- 
up would be measured as secondary outcomes, it is 
much better to block on size. [Lewsey (2004) dis- 
cusses size blocking in some detail.] Perhaps the 
investigators were too quick to reject this option. 
In any case, the comparison of Var((i(Z, x)), from 
(6), for various blocking schemes and covariates, x, 
would more directly have informed their decision. 

6.2 Whether to Poststratify, and Whether a 
Given Poststratification Suffices 

Comparative studies typically present a small num- 
ber of covariates that must be balanced in order for 
the study to be convincing, along with a longer list of 
variables on which balance would be advantageous. 



14 B. B. HANSEN AND J. BOWERS 

Table 4 

Standard deviations of d{7i,yi) under various stratification schemes, expressed as fractions of an s.d. of 'x./fh 



Stratification 






Baseline variable 




Adequate 
assessment 


aspirin 


Treatment with 
hypotensives 


lipid-reducers 


None 


0.46 


0.46 


0.46 


0.46 


By rate of adequate assessment 


0.31 


0.42 


0.43 


0.36 


By clinic size 


0.33 


0.24 


0.24 


0.31 



Both stratification schemes offer distinctly better expected balance than no stratification at all, and stratification on clinic 
size seems preferable to stratification on clinics' baseline rates of adequate assessment. 



In the ASSIST a trial, the short hst consists of base- 
hne measures on variables to be used as outcomes; 
in the Vote'98 experiment, it comprises a "baseline" 
measure of the outcome, voting in the previous elec- 
tion, along with party membership and demographic 
data that predict voting. Were treatment subjects 
appreciably older, and so perhaps more likely to 
vote (Highton and Wolfinger, 2001) than controls, 
or were they more likely to have voted in past elec- 
tions, then one would suspect appreciable positive 
error in unadjusted estimates of the treatment 
effect — even in the presence of randomization, which 
controls such errors most of the time but not all of 
the time. 

Even if discovered only after treatments have been 
applied, such imbalances can be remedied by post- 
stratification: if treatments are on the whole older 
than controls, for example, then compare older treat- 
ments only to older controls, and also compare 
younger subjects only among themselves. There is 
the possibility that one could introduce imbalances 
on other variables by sub classifying on age; to as- 
sess this, one might apply (i^(z;xi, . . . ,Xfc), where 
xi, . . . ,Xfc make up the short list, to the poststrati- 
fied design. Should subclassifying only on age fail to 
sufficiently reduce d?{z;xi, . . . ,Xfc), or should there 
be a more complex pattern of misalignment to be- 
gin with, propensity-score methods are a reliable 
alternative (Rosenbaum and Rubin (1984), and Hil- 
let at al., 2000). Indeed, with the option of propen- 
sity score subclassification, there is little reason to 
restrict one's attention entirely to the short list; one 
can reasonably hope to relieve gross imbalances on 
any of a longer list of covariates, as well as smaller 
imbalances on the most important ones. 

Perhaps with this in mind, Imai (2005) suggests 
checking the Vote'98 data for imbalance twice, once 
focusing on short-list variables and a second time 
considering also second-order interactions of them. 



As discussed by Arceneaux et al. (2004), and as the 
discussion of Section 2 would predict, his logistic- 
regression based check gives misleading results. De- 
spite this technical impediment, however, the spirit 
of the suggestion is sound; one might hope the check 
based on would perform more reliably. In fact it 
does: in 10^ simulated reassignments of telephone 
GOTV, the statistic combining imbalances on 
all first- and second-order interactions of x-variables 
exceeded nominal 0.001, 0.01, 0.05 and 0.10 levels 
of the x^(363) distribution in 0.09%, 0.9%, 4.8%, 
and 9.7% of trials, respectively. The treatment as- 
signment actually used gives, for the long list, d? = 
360.6, with theoretical and simulation p- values 0.526 
and 0.527, respectively, and for the short list, = 
26.6 on 38 d.f.'s, with p-values 0.918 and 0.918, re- 
spectively; it is well balanced. 

7. SUMMARY AND DISCUSSION 

Clinical trials methodologists note, with some alarm, 
how few cluster randomized trials explicitly make 
note of cluster-level assignment and account for it in 
the analysis (Divine et al. (1992); MacLennan et al. 
(2003); Isaakidis and loannidis (2003)). We have seen 
the need for such an accounting even when it seems 
least necessary, with clusters that are small, uniform 
in size, and numerous. We have also seen that one 
natural model-based test for balance along covari- 
ates, the test based on logistic regression, is prone 
to spuriously indicate lack of balance when there 
are too many covariates relative to observations, and 
that this condition obtains for surprisingly large ra- 
tios of observations to the number of covariates. 

Cluster-level randomization is said to confront in- 
vestigators with "a bewildering array of possible ap- 
proaches to the data analysis" (Donner and Klar, 
1994). Randomization inference presents a less clut- 
tered field of options, and has the additional advan- 



BALANCE IN COMPARATIVE STUDIES 



15 



tages of adaptation specifically to comparative stud- 
ies and of being nonparametric. With appropriate 
attention to the form of the test statistic, it is quite 
possible in the randomization framework to respect 
the study's design while training attention on dif- 
ferences among individuals. This aim also suggests 
conditioning strategies appropriate to the problem 
of assessing covariate balance. The result is a class 
of test statistics that one can expediently appraise 
using Normal approximations which are quite accu- 
rate in small and moderate samples. The tests gauge 
balance on a single covariate or on a set of covariates 
jointly; in the latter case, they also implicitly assess 
imbalance on linear combinations of the covariates, 
including projections of the response variable into 
covariate space. Our analysis of a model of biased as- 
signment suggests values for tuning parameters that 
completely specify, indeed simplify, the form of the 
resulting nonparametric tests, ending with a simple 
prescription that is suitable for general use: assess 
balance along individual covariates x with the differ- 
ences d{z, x) between treatment and control groups' 
adjusted means, using weights Wb oc fhbhb to com- 
bine across blocks as in (5); then mount an overall 
test by referring the combined baseline difference 
(i^(z;xi, . . . ,Xfc), as in (9), to the appropriate x^- 
distribution. 

As an omnibus measure of balance, the combined 
baseline difference statistic (i^(z;xi, . . . ,Xfc) is sim- 
ilar in form and spirit to a statistic suggested by 
Raab and Butcher (2001), namely a weighted sum 
of squares of differences of means of cluster means: 
ai(i(z,xi/m)^ + • • • -|- afc(i(z, x^/m)^, where ai, . . . , 
Q!fc > sum to 1. The ability of the statistician to de- 
cide the relative weightings a of the variables might 
in some contexts be an advantage, but in others it 
may be burdensome. In all cases it lends some arbi- 
trariness to the criterion. Also, the criterion directly 
measures only imbalances in xi, . . . , x^. In contrast, 
;xi, . . . ,Xfc) measures imbalances in linear com- 
binations of xi , . . . , Xfc as much as in these variables 
themselves, lets the data drive the weighting scheme, 
upweighting discrepancies along variables with less 
variation in general, and has the advantage of easy 
calibration against tables. 

Altman (1985), Begg (1990) and Senn (1994) crit- 
icize the use of balance tests to decide which covari- 
ates to adjust for in the outcome analysis of a clini- 
cal trial, arguing that these judgments should rather 
be made on the basis of the prognostic value of the 
covariate. These criticisms are sometimes taken to 



support the stronger conclusion that balance tests 
are inappropriate for any purpose. The criticisms 
do not, however, speak against the use of balance 
tests to detect problems of implementation, nor do 
they preclude a possible role for assessments of bal- 
ance in the interpretation of study results (Begg, 
1990). Indeed, the CONSORT statement on reporting 
in clinical trials (Begg et al., 1996) recommends that 
reports include assessments of balance on variables 
of possible prognostic value. 

Section 5 established optimality of tests based on 
d and within one class of balance criteria and 
under certain conditions, but in some settings other 
statistics may be better equipped to reveal biased 
assignment. For instance, in some clinical trials that 
enroll patients sequentially and at the discretion of 
their physicians it is possible for the physician to 
guess or infer the treatment to which a potential pa- 
tient would be assigned; the methods of 
Berger and Exner (1999) and Berger (2005) model 
patterns of assignment that would occur if physi- 
cians were using this foreknowledge to the advan- 
tage of one assignment arm or the other, and may 
have greater power in such situations. 

When sub classifying or matching on the propen- 
sity score, systematic appraisals of balance are needed 
to check and tune the propensity adjustment 
(Rosenbaum and Rubin, 1984). An exact propensity 
stratification would make an observational study as 
well-balanced as if its treatment conditions had been 
assigned randomly within the strata, but the in- 
evitably more crude propensity stratifications that 
are available in practice may yield less balance. Bal- 
ance tests based on are particularly well-suited 
to adjudicate the success or failure of a given inex- 
act propensity model and stratification procedure. 
In contrast with the case of randomized assignment, 
propensity adjustment inevitably leaves at least some 
within-stratum variation in probabilities of assign- 
ment to treatment, making it certain that the hy- 
pothesis of unbiased allocation is false, at least in 
detail (Hansen, 2008). One hopes, however, that the 
bias is sufficiently small so as not to imbalance co- 
variates discernibly more than random assignment 
would be expected to have done, and this is precisely 
the question that addresses. With its focus on 
the randomization distribution, it avoids modeling 
treatment and control observations as having been 
sampled from respective superpopulations, an unde- 
sirable feature of many other balance tests 
(Imai et al., 2008). Another advantage of d and d'^ 
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for observational studies is that they apply with- 
out modification to matched data, by treating the 
matched sets as strata; likelihood ratio tests of lo- 
gistic regression models, by contrast, are not con- 
sistent when used in this way (Agresti (2002), Sec- 
tion 6.3.4). 

APPENDIX: PROOF OF PROPOSITION 5.1 

Let there be constants {x,jbi}, {rriubi}, and random 
indicator variables {Z^f,i}, arranged in triangular ar- 
rays the z^th rows of which contain rii, entries, = 
(4i 1 • • • > 4b, ) . = (mti , . . . , ui^j^^ ) and = 
(Zti, • • • , Z^^^J, respectively, where x^b = {xm, 
Xubn^^y, m„b = {mubi-, ■ ■ ■,miybn^^y and Z^b = {Zubi, 
. . . , Z^bn„iY, for some whole numbers Bi, and n^i, . . . , 
iT-uB^- Within a given row i/, x^b,''^ub, and Z^b de- 
scribe cluster totals on a variable x, cluster sizes (av- 
eraging to rfiub) and treatment assignments within 
block 6, any h < By. Suppose 1 < n^tb < n^b for all 
ly, b, and assume of the random variables Zi,b that 
with probability 1, Z*^,! = n,^tb for each and b < 
Bij] say vectors that lack this property are ex- 
cluded. The null hypothesis asserts that for each 
u and block b < B^, F{Z^bi = 1) is the same for 
all indices i. Alternately put, its probability den- 
sity P{zy) vanishes for excluded Zjy and otherwise 
is proportional to (7) with ip^ = 0. For alternatives 
Q to this null, define (for nonexcluded z^) a like- 
lihood proportional to (7), with bias function V'a;(") 
the same for all u. Assume of this sequence of models 
that: 

Al {m,^bi} is uniformly bounded, and {n^tb/n-ub} is 
uniformly bounded away from and 1; 

A2 weights w^b have the property that Wyb/w^^^ is 
uniformly bounded away from and oo, where 



oc mybriutbi'i- - riutb/nub) and Y.b'<^, 



vb 



1; 



ensures tightening dispersion of x's around c. In 
particular, combined with Al, condition A3 entails 
T.bwlbslb{^vb) / m^b is 0{n~'^), or that with weight- 
ing by either or Wy, the weighted average of 
block mean differences d{Zy,ii.y) has variance of or- 
der 0{n~'^) — see Proposition 3.1 and Corollary 3.1. 

We establish Proposition 5.1 using principles of 
contiguity (Le Cam (1960); Hajek and Sidak (1967)), 
which describe the limiting Q-distribution of a test 
statistic t{Z) in terms of the limiting joint distri- 
bution, under P, of (t(Z), log ^(Z)). A technical 
lemma. Lemma A.l, is needed, after which contigu- 
ity results are invoked to establish Lemma A. 2 (from 
which the proposition is immediate). 

Lemma A.l. Under the hypotheses of Proposi- 
tion 5.1, 



A3 for some c, s\y[>bi\xybi — c| J, and X]fe<_B, 

Y.i{xubi - cf is 0(1) as ^ oo; 
A4 ipx is differentiable at c, where c is the constant 

referred to in A3. 

Condition Al has the side-effect of limiting the di- 
vergence of w*i^ and other common weighting schemes; 
should weights Wyb be proportional to the number of 
subjects in a block, the number of treatment group 
subjects in a block, or the total of controls by block, 
then by condition Al, w^i^/uiub will be universally 
bounded away from and oo. In other words, given 
Al, condition A2 is not restrictive. Condition Al 
also ensures that J2b^i^b^^ub is 0{ny). Condition A3 



\og'§{Z) n( -\p'K six, (3' K six 



(where denotes convergence in distribution un- 
der P). 

Lemma A. 2. Under the hypotheses of Proposi- 
tion 5.1, 

A.l Proof of Lemma A.l 

Without loss of generality, the c named in condi- 
tions A3 and A4 is 0. Then one has ipxix) = V'^.(0)x + 
o{\x\) = (3x + xe{x), where, because of condition A3, 
maxfj^j |e(xj^fcj)| | as j oo. Since Q is defined by (7) 
and P is defined by (7) without the ipx term, one can 
write 



1 '^Q try \ 
log^(Z.,) 



(A.l) 



'^ubi-^i^b(i{^i/b^ Xi/bG 
b 

+ K^p — KjjQ 

= : Xy + Yy- {KyQ - Kyp), 

for appropriate constants n^p , k^q . 

By calculations similar to those justifying Propo- 
sition 3.1, Varp(X;^) = l3'^J2b^'^bs'^i^'^b)- By condi- 
tion A3 and (10), this variance approaches P'^Ksqx- 
By the discussion following (3), Varp(Yjy) = J2b ^fb ^ 
s^(x,y{,e(xj^6)). By condition A3, this is O(e^) as u ] 
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oo, where €.y := supfe j \e(xy^,i)\. By condition A3 and 
A4, of course, | as t cxo; thus Varp(Y'iy) | as 
v'\ oo. Since (as we have seen) Varp(Xj,) is 0(1), it 
follows also that Covp(Xjy,l^) = 0(e,y), and overall 
Varp(X,, + Yy) ii'^Ks\^ as T oo. 

Clearly both Xy and have expectation 0, un- 
der P. Since the random term Xy -|- Yy is, as in Sec- 
tion 3.2, a sum of totals of simple random samples, 
its limiting law (under P) must be X (^^0^Ks\^. 

It remains to be shown that K,yQ — Kyp ^P'^Ksq^. 
Since 'Ep{{dQ/dP){Z)) = 1, exp{Ki/Q - Kyp} = 
Ep{e^''^^''). From what was just shown it follows 

immediately that e^"^'^" ^ e^^'^'^^^^oo;), the expec- 
tation of which equals the moment generating func- 
tion of the standard Normal distribution evaluated 
at PK^ Sq^, or exp{^;9^SQj,}. So the conclusion fol- 
lows if we can establish that Ep{e^''~^^'^) converges 
to E(e^('^'^ ^*oa^)). This would follow from uniform 
integrability of the random variables e'^"^^" , which 
would follow in turn from sup„ Ep(e^"^^^-"-"'''''^^'''*) < 
oo, any e > 0. 

The rest of the argument verifies this by establish- 
ing the technical condition that 
limsup^ Ep(exp{\/2(^!/ + Yy)}) < oo. We make use 
of a theorem of Hoeffding (1963), to the effect that 
the expectation of a convex continuous function of 
a sum of a simple random sample is bounded above 
by the expectation of the same function of a simi- 
larly sized with-replacement sample from the same 
population, and of the fact from calculus that if 
for a triangular array {cy} of nonnegative numbers. 



max 



■J Cij I while J2i Cij A, then Yiji^ + Ci 



Write myi,{t) for the moment generating function 
Mx)yb), SO that Ep(e*(_^''+'*''')) = 

is the 



'^X{x)yb) 



of '^ybi'^xi'Xyb) - HJx\-^)yb) 

Ylb^vbit)- Under P, Zl^{ip^{xyb) 
sum of a simple random sample of size Uytb, and 
by Hoeffding's theorem myb{t) < {'rhyb{t))^'^^'' , where 
rhybit) is the moment generating function of a single 
draw, Dyb, from {ipxi^ubi) - ^x(a;),.fei < nyb}. By 
Taylor approximation, for each v and b, myb{^/2) = 
l + Ep(I)2^exp{t*^L>^b}), some t*^ E [0,^/2]. We now 
need to show that maxf, Ep(L)^jj exp{i*j,Z?jyfe}) J, and 
Efen^tbEp(D2^exp{t*ftL»^fe}) is 0(1). By condition 
A3, as u increases D^^ x expjt*^!),^;,} is determin- 
istically bounded by constants tending to 0, entail- 
ing max{,Ep(D2^exp{t*^Di,b}) [ 0. exp{t*^L»,,fe} also 
declines to deterministically, so that the sum of 
n,tbEp(D2^exp{t*,D,6}) is 0(1) iiY.bnutbMDlb) = 
T.b^utbcr'^ii'xi^ub)) is. Now Y^b'^iytbC^^i'^xi^iyb)) = 
Y.b^vtbP'^^'^i'^yb) + 'Eb'IT'vtbf^'^ii'xiXub) - PXyb) + 



'iY.b^utbfi(y{y.yb, ipxi^ub)- f^^ub)- Invoking condition 
A3, the first of these three sums may be seen to be 
0(1), and the latter two O(e^) and 0{ey), respec- 
tively, as V ] oo. It follows that n6("^i^b(v^))"'''*') 
and hence Oft "^i^folv^), are 0(1), confirming that 
^^x„+Y,y :/ = !,...} is uniformly integrable. 

A. 2 Proof of Lemma A. 2 

Write Ty ■=Yaip{d{Zy,:>^y))~^/'^d{Zy,^y). By ar- 

P 

guments of Section 3.2, Ty =^ A^(0,1). Combining 
this with Lemma A.l, one has that 



Ty,lOg^iZy) 



X 



(0,-aV2), 



a 



for some as yet to be determined r. This establishes 
the premise of Le Cam's Third Lemma (Le Cam 
(1960); Hajek and Sidak (1967)), the conclusion of 
which is that the limit law under Q of the random 
variable Ty is X{r, 1). We now calculate r. 
Using the notation of (A.l), Cov(r^, log ^(Z,,)) = 

Cov(T^,x^) + Cov(r^,y^). Now |Cov(r^,y^)| < 

(Var(r^) Xar{Yy)f^^ = YariYy)^/'^, which was shown 
in the proof of Lemma A.l to decline to as in- 
creases. Considering only nonexcluded treatment as- 
signments Zy, 



COV p{Tn,Xy) 

= y~^/2covp| 



Wyb 



y 

.b^i ^yb-niyb 



'^vb^i^b: 



51 /^ZtfeXyfe 
b / 



Wb 



^ hybiriyb 



Xarp{Zl^iCyb) 



,pY ^l'^y^wbs^{y.yb)/myb, 



writing V for Varp((i(Z^,x^)), invoking indepen- 
dence of Zb and Zf,/ , 6 7^ 6', and evaluating Varp(Zj^Xf,) 
in the same manner as led to Proposition 3.1. Ac- 
cording to (11), then, CovpiTy.Xy)^ (3^. It fol- 

lows that r = P- 



s2 

wx 
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